AITopics | term frequency

Collaborating Authors

term frequency

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SCADE: Scalable Framework for Anomaly Detection in High-Performance System

Vinay, Vaishali, Mangal, Anjali

arXiv.org Artificial IntelligenceDec-9-2024

As command-line interfaces remain integral to high-performance computing environments, the risk of exploitation through stealthy and complex command-line abuse grows. Conventional security solutions struggle to detect these anomalies due to their context-specific nature, lack of labeled data, and the prevalence of sophisticated attacks like Living-off-the-Land (LOL). To address this gap, we introduce the Scalable Command-Line Anomaly Detection Engine (SCADE), a framework that combines global statistical models with local context-specific analysis for unsupervised anomaly detection. SCADE leverages novel statistical methods, including BM25 and Log Entropy, alongside dynamic thresholding to adaptively detect rare, malicious command-line patterns in low signal-to-noise ratio (SNR) environments. Experimental results show that SCADE achieves above 98% SNR in identifying anomalous behavior while minimizing false positives. Designed for scalability and precision, SCADE provides an innovative, metadata-enriched approach to anomaly detection, offering a robust solution for cybersecurity in high-computation environments. This work presents SCADE's architecture, detection methodology, and its potential for enhancing anomaly detection in enterprise systems. We argue that SCADE represents a significant advancement in unsupervised anomaly detection, offering a robust, adaptive framework for security analysts and researchers seeking to enhance detection accuracy in high-computation environments.

data mining, detection, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2412.04259

Country:

North America > United States > Washington > King County > Redmond (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(7 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Information Technology > Security & Privacy (1.00)
Commercial Services & Supplies > Security & Alarm Services (0.88)
Government > Military > Cyberwarfare (0.34)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Probing Ranking LLMs: Mechanistic Interpretability in Information Retrieval

Chowdhury, Tanya, Allan, James

arXiv.org Artificial IntelligenceOct-24-2024

Transformer networks, especially those with performance on par with GPT models, are renowned for their powerful feature extraction capabilities. However, the nature and correlation of these features with human-engineered ones remain unclear. In this study, we delve into the mechanistic workings of state-of-the-art, fine-tuning-based passage-reranking transformer networks. Our approach involves a probing-based, layer-by-layer analysis of neurons within ranking LLMs to identify individual or groups of known human-engineered and semantic features within the network's activations. We explore a wide range of features, including lexical, document structure, query-document interaction, advanced semantic, interaction-based, and LLM-specific features, to gain a deeper understanding of the underlying mechanisms that drive ranking decisions in LLMs. Our results reveal a set of features that are prominently represented in LLM activations, as well as others that are notably absent. Additionally, we observe distinct behaviors of LLMs when processing low versus high relevance queries and when encountering out-of-distribution query and document sets. By examining these features within activations, we aim to enhance the interpretability and performance of LLMs in ranking tasks. Our findings provide valuable insights for the development of more effective and transparent ranking models, with significant implications for the broader information retrieval community. All scripts and code necessary to replicate our findings are made available.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2410.18527

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
North America > United States > District of Columbia > Washington (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > Italy > Marche > Ancona Province > Ancona (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

When Less Is Not More: Large Language Models Normalize Less-Frequent Terms with Lower Accuracy

Hier, Daniel B., Do, Thanh Son, Obafemi-Ajayi, Tayo

arXiv.org Artificial IntelligenceSep-11-2024

Term normalization is the process of mapping a term from free text to a standardized concept and its machine-readable code in an ontology. Accurate normalization of terms that capture phenotypic differences between patients and diseases is critical to the success of precision medicine initiatives. A large language model (LLM), such as GPT-4o, can normalize terms to the Human Phenotype Ontology (HPO), but it may retrieve incorrect HPO IDs. Reported accuracy rates for LLMs on these tasks may be inflated due to imbalanced test datasets skewed towards high-frequency terms. In our study, using a comprehensive dataset of 268,776 phenotype annotations for 12,655 diseases from the HPO, GPT-4o achieved an accuracy of 13.1% in normalizing 11,225 unique terms. However, the accuracy was unevenly distributed, with higher-frequency and shorter terms normalized more accurately than lower-frequency and longer terms. Feature importance analysis, using SHAP and permutation methods, identified low-term frequency as the most significant predictor of normalization errors. These findings suggest that training and evaluation datasets for LLM-based term normalization should balance low- and high-frequency terms to improve model performance, particularly for infrequent terms critical to precision medicine.

accuracy, frequency, term frequency, (12 more...)

arXiv.org Artificial Intelligence

2409.13746

Country:

North America > United States > Missouri > Greene County > Springfield (0.05)
North America > United States > Missouri > Phelps County > Rolla (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

How Lexical is Bilingual Lexicon Induction?

Kohli, Harsh, Feng, Helian, Dronen, Nicholas, McCarter, Calvin, Moeini, Sina, Kebarighotbi, Ali

arXiv.org Artificial IntelligenceApr-5-2024

In contemporary machine learning approaches to bilingual lexicon induction (BLI), a model learns a mapping between the embedding spaces of a language pair. Recently, retrieve-and-rank approach to BLI has achieved state of the art results on the task. However, the problem remains challenging in low-resource settings, due to the paucity of data. The task is complicated by factors such as lexical variation across languages. We argue that the incorporation of additional lexical information into the recent retrieve-and-rank approach should improve lexicon induction. We demonstrate the efficacy of our proposed approach on XLING, improving over the previous state of the art by an average of 2\% across all language pairs.

computational linguistic, frequency, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2404.04221

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.05)
Europe > Italy > Tuscany > Florence (0.04)
(11 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

Analysing the Impact of Removing Infrequent Words on Topic Quality in LDA Models

Bystrov, Victor, Naboka-Krell, Viktoriia, Staszewska-Bystrova, Anna, Winker, Peter

arXiv.org Artificial IntelligenceNov-24-2023

The use of topic modelling techniques, especially Latent Dirichlet Allocation (LDA) introduced by Blei et al. (2003), is growing fast. The methods find application in a broad variety of domains. In text-as-data applications, LDA enables the analysis of large collections of text in an unsupervised manner by uncovering latent structures behind the data. Given this increasing use of LDA as a standard tool for empirical analysis, also the interest in details of the method and, in particular, in parameter settings for its implementation is rising. Thus, since the introduction of the LDA approach in 2003 by Blei et al., different methodological components of LDA have already been studied in more detail as, for example, the choice of the number of topics (Cao et al., 2009; Mimno et al., 2011; Lewis and Grossetti, 2022; Bystrov et al., 2022a), hyper-parameter settings (Wallach et al., 2009), model design (e.g.

cut-off value, frequency, vocabulary pruning, (16 more...)

arXiv.org Artificial Intelligence

2311.14505

Country:

Europe > Poland > Łódź Province > Łódź (0.05)
Europe > Germany (0.04)
Asia > Middle East > Jordan (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Government (0.68)
Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.92)

Add feedback

Feature Engineering in Learning-to-Rank for Community Question Answering Task

Sajid, Nafis, Hasan, Md Rashidul, Ibrahim, Muhammad

arXiv.org Artificial IntelligenceSep-14-2023

Community question answering (CQA) forums are Internet-based platforms where users ask questions about a topic and other expert users try to provide solutions. Many CQA forums such as Quora, Stackoverflow, Yahoo!Answer, StackExchange exist with a lot of user-generated data. These data are leveraged in automated CQA ranking systems where similar questions (and answers) are presented in response to the query of the user. In this work, we empirically investigate a few aspects of this domain. Firstly, in addition to traditional features like TF-IDF, BM25 etc., we introduce a BERT-based feature that captures the semantic similarity between the question and answer. Secondly, most of the existing research works have focused on features extracted only from the question part; features extracted from answers have not been explored extensively. We combine both types of features in a linear fashion. Thirdly, using our proposed concepts, we conduct an empirical investigation with different rank-learning algorithms, some of which have not been used so far in CQA domain. On three standard CQA datasets, our proposed framework achieves state-of-the-art performance. We also analyze importance of the features we use in our investigation. This work is expected to guide the practitioners to select a better set of features for the CQA retrieval task.

algorithm, dataset, frequency, (16 more...)

arXiv.org Artificial Intelligence

2309.0761

Country:

Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
North America > United States (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.90)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

MPTopic: Improving topic modeling via Masked Permuted pre-training

Zhang, Xinche, milios, Evangelos

arXiv.org Artificial IntelligenceSep-2-2023

Topic modeling is pivotal in discerning hidden semantic structures within texts, thereby generating meaningful descriptive keywords. While innovative techniques like BERTopic and Top2Vec have recently emerged in the forefront, they manifest certain limitations. Our analysis indicates that these methods might not prioritize the refinement of their clustering mechanism, potentially compromising the quality of derived topic clusters. To illustrate, Top2Vec designates the centroids of clustering results to represent topics, whereas BERTopic harnesses C-TF-IDF for its topic extraction.In response to these challenges, we introduce "TF-RDF" (Term Frequency - Relative Document Frequency), a distinctive approach to assess the relevance of terms within a document. Building on the strengths of TF-RDF, we present MPTopic, a clustering algorithm intrinsically driven by the insights of TF-RDF. Through comprehensive evaluation, it is evident that the topic keywords identified with the synergy of MPTopic and TF-RDF outperform those extracted by both BERTopic and Top2Vec.

algorithm, hdbscan, mptopic, (14 more...)

arXiv.org Artificial Intelligence

2309.01015

Country: North America > Canada > Nova Scotia > Halifax Regional Municipality > Halifax (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Novel Dual of Shannon Information and Weighting Scheme

Zhang, Arthur Jun

arXiv.org Artificial IntelligenceApr-25-2023

Shannon Information theory has achieved great success in not only communication technology where it was originally developed for but also many other science and engineering fields such as machine learning and artificial intelligence. Inspired by the famous weighting scheme TF-IDF, we discovered that information entropy has a natural dual. We complement the classical Shannon information theory by proposing a novel quantity, namely troenpy. Troenpy measures the certainty, commonness and similarity of the underlying distribution. To demonstrate its usefulness, we propose a troenpy based weighting scheme for document with class labels, namely positive class frequency (PCF). On a collection of public datasets we show the PCF based weighting scheme outperforms the classical TF-IDF and a popular Optimal Transportation based word moving distance algorithm in a kNN setting. We further developed a new odds-ratio type feature, namely Expected Class Information Bias(ECIB), which can be regarded as the expected odds ratio of the information quantity entropy and troenpy. In the experiments we observe that including the new ECIB features and simple binary term features in a simple logistic regression model can further significantly improve the performance. The simple new weighting scheme and ECIB features are very effective and can be computed with linear order complexity.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2304.12814

Country:

North America > United States > Massachusetts > Middlesex County > Waltham (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report > Experimental Study (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.88)

Add feedback

Understanding TF-IDF in NLP: A Comprehensive Guide

#artificialintelligenceMar-21-2023, 06:10:58 GMT

Natural Language Processing (NLP) is an area of computer science that focuses on the interaction between human language and computers. One of the fundamental tasks of NLP is to extract relevant information from large volumes of unstructured data. In this article, we will explore one of the most popular techniques used in NLP called TF-IDF. TF-IDF is a numerical statistic that reflects the importance of a word in a document. It is commonly used in NLP to represent the relevance of a term to a document or a corpus of documents.

corpus, frequency, tf-idf, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Theory Behind the Basics of NLP - Analytics Vidhya

#artificialintelligenceAug-2-2022, 13:30:46 GMT

This article was published as a part of the Data Science Blogathon. Natural Language Processing (NLP) can help you to understand any text's sentiments. This is helpful for people to understand the emotions and the type of text they are looking over. Negative and Positive comments can be easily differentiated. NLP wanted to make machines understand the text or comment the same way humans can.

corpus, frequency, vocabulary, (15 more...)

#artificialintelligence

Country: Europe > Holy See > Vatican City (0.05)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback